When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.
translated by 谷歌翻译
Recent work in sim2real has successfully enabled robots to act in physical environments by training in simulation with a diverse ''population'' of environments (i.e. domain randomization). In this work, we focus on enabling generalization in assistive tasks: tasks in which the robot is acting to assist a user (e.g. helping someone with motor impairments with bathing or with scratching an itch). Such tasks are particularly interesting relative to prior sim2real successes because the environment now contains a human who is also acting. This complicates the problem because the diversity of human users (instead of merely physical environment parameters) is more difficult to capture in a population, thus increasing the likelihood of encountering out-of-distribution (OOD) human policies at test time. We advocate that generalization to such OOD policies benefits from (1) learning a good latent representation for human policies that test-time humans can accurately be mapped to, and (2) making that representation adaptable with test-time interaction data, instead of relying on it to perfectly capture the space of human policies based on the simulated population only. We study how to best learn such a representation by evaluating on purposefully constructed OOD test policies. We find that sim2real methods that encode environment (or population) parameters and work well in tasks that robots do in isolation, do not work well in assistance. In assistance, it seems crucial to train the representation based on the history of interaction directly, because that is what the robot will have access to at test time. Further, training these representations to then predict human actions not only gives them better structure, but also enables them to be fine-tuned at test-time, when the robot observes the partner act. https://adaptive-caregiver.github.io.
translated by 谷歌翻译
当从人类行为中推断出奖励功能(无论是演示,比较,物理校正或电子停靠点)时,它已证明对人类进行建模作为做出嘈杂的理性选择,并具有“合理性系数”,以捕获多少噪声或熵我们希望看到人类的行为。无论人类反馈的类型或质量如何,许多现有作品都选择修复此系数。但是,在某些情况下,进行演示可能要比回答比较查询要困难得多。在这种情况下,我们应该期望在示范中看到比比较中更多的噪音或次级临时性,并且应该相应地解释反馈。在这项工作中,我们提倡,将每种反馈类型的实际数据中的理性系数扎根,而不是假设默认值,对奖励学习具有重大的积极影响。我们在模拟反馈以及用户研究的实验中测试了这一点。我们发现,从单一反馈类型中学习时,高估人类理性可能会对奖励准确性和遗憾产生可怕的影响。此外,我们发现合理性层面会影响每种反馈类型的信息性:令人惊讶的是,示威并不总是最有用的信息 - 当人类的行为非常卑鄙时,即使在合理性水平相同的情况下,比较实际上就变得更加有用。 。此外,当机器人确定要要求的反馈类型时,它可以通过准确建模每种类型的理性水平来获得很大的优势。最终,我们的结果强调了关注假定理性级别的重要性,不仅是在从单个反馈类型中学习时,尤其是当代理商从多种反馈类型中学习时,尤其是在学习时。
translated by 谷歌翻译
我们如何才能训练辅助人机接口(例如,基于肌电图的肢体假体),将用户的原始命令信号转换为机器人或计算机的动作,如果没有事先映射,我们不能要求用户进行监督动作标签或奖励反馈的形式,我们对用户试图完成的任务没有事先了解?本文中的关键想法是,无论任务如何,当接口更直观时,用户的命令就会不那么嘈杂。我们将这一想法形式化为一个完全无监督的目标,以优化接口:用户的命令信号与环境中的诱导状态过渡之间的相互信息。为了评估此相互信息得分是否可以区分有效的界面和无效界面,我们对540K的示例进行了观察性研究,该示例的用户操作各种键盘和眼睛凝视接口,用于打字,控制模拟机器人和玩视频游戏。结果表明,我们的共同信息得分可预测各个领域中的基础任务完成指标,而Spearman的平均等级相关性为0.43。除了对现有接口的离线评估外,我们还使用无监督的目标从头开始学习接口:我们随机初始化接口,让用户尝试使用接口执行其所需的任务,测量相互信息得分并更新接口通过强化学习最大化相互信息。我们通过用户研究与12名参与者进行用户研究评估我们的方法,他们使用扰动的鼠标执行2D光标控制任务,并使用手势使用手势的一个用户玩《 Lunar Lander》游戏的实验。结果表明,我们可以在30分钟内从头开始学习一个接头,无需任何用户监督或任务的先验知识。
translated by 谷歌翻译
我们的目标是使机器人能够以情感方式执行功能任务,无论是响应用户的情绪状态还是表达其信心水平。先前的工作已经提出了从用户反馈中每个目标情绪的学习独立成本功能,以便机器人可以在遇到的任何情况下将其与任务和环境特定目标一起优化。但是,在建模多种情绪并且无法推广到新的情绪时,这种方法效率低下。在这项工作中,我们利用了一个事实,即情绪并非彼此独立:它们是通过价值占主导地位的潜在空间(VAD)相关的。我们的关键想法是学习一个模型,以使用用户标签映射到VAD上。考虑到轨迹的映射和目标VAD之间的距离,可以使该单个模型代表所有情绪的成本功能。结果1)所有用户反馈都可以促进学习每一个情绪; 2)机器人可以为空间中的任何情感生成轨迹,而不仅仅是少数预定义的轨迹; 3)机器人可以通过将其映射到目标VAD来对用户生成的自然语言进行情感响应。我们介绍了一种交互式学习将轨迹映射到该潜在空间并在模拟和用户研究中对其进行测试的方法。在实验中,我们使用一个简单的真空机器人以及Cassie Biped。
translated by 谷歌翻译
真实世界的机器人任务需要复杂的奖励功能。当我们定义机器人需要解决的问题时,我们假装设计人员确切地指定了这种复杂的奖励,并且从那时起,它被设置为石头。然而,在实践中,奖励设计是一个迭代过程:设计师选择奖励,最终遇到奖励激励错误行为的“边缘案例”环境,修改奖励和重复。重新思考机器人问题是什么意思,正式占奖励设计的这种迭代性质?我们建议机器人不采取特定的奖励,而是对其进行不确定性,并占未来设计迭代作为未来的证据。我们贡献了辅助奖励设计方法,通过预测和影响未来的证据来加速设计过程:而不是让设计师最终遇到故障情况并修改奖励,该方法在开发阶段主动地将设计者暴露于这种环境。我们在简化的自主驾驶任务中测试此方法,并发现它通过提出当前奖励的“边缘案例”的环境,更快地提高汽车的行为。
translated by 谷歌翻译
Humans have internal models of robots (like their physical capabilities), the world (like what will happen next), and their tasks (like a preferred goal). However, human internal models are not always perfect: for example, it is easy to underestimate a robot's inertia. Nevertheless, these models change and improve over time as humans gather more experience. Interestingly, robot actions influence what this experience is, and therefore influence how people's internal models change. In this work we take a step towards enabling robots to understand the influence they have, leverage it to better assist people, and help human models more quickly align with reality. Our key idea is to model the human's learning as a nonlinear dynamical system which evolves the human's internal model given new observations. We formulate a novel optimization problem to infer the human's learning dynamics from demonstrations that naturally exhibit human learning. We then formalize how robots can influence human learning by embedding the human's learning dynamics model into the robot planning problem. Although our formulations provide concrete problem statements, they are intractable to solve in full generality. We contribute an approximation that sacrifices the complexity of the human internal models we can represent, but enables robots to learn the nonlinear dynamics of these internal models. We evaluate our inference and planning methods in a suite of simulated environments and an in-person user study, where a 7DOF robotic arm teaches participants to be better teleoperators. While influencing human learning remains an open problem, our results demonstrate that this influence is possible and can be helpful in real human-robot interaction.
translated by 谷歌翻译
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.
translated by 谷歌翻译
One of the most successful paradigms for reward learning uses human feedback in the form of comparisons. Although these methods hold promise, human comparison labeling is expensive and time consuming, constituting a major bottleneck to their broader applicability. Our insight is that we can greatly improve how effectively human time is used in these approaches by batching comparisons together, rather than having the human label each comparison individually. To do so, we leverage data dimensionality-reduction and visualization techniques to provide the human with a interactive GUI displaying the state space, in which the user can label subportions of the state space. Across some simple Mujoco tasks, we show that this high-level approach holds promise and is able to greatly increase the performance of the resulting agents, provided the same amount of human labeling time.
translated by 谷歌翻译
Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the FlexiBiT framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single FlexiBiT model is simultaneously capable of carrying out many tasks with performance similar to or better than specialized models. Additionally, we show that performance can be further improved by fine-tuning our general model on specific tasks of interest.
translated by 谷歌翻译